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Abstract. Spoken language technology is one of the domains in which, in our days, 
machine learning algorithms and especially neural networks are used. Some applications 
will pe presented in this paper: detecting overlapped speech on short time frames (till 
25ms), emotion recognition from speech (including speech stress detection and deceptive 
speech detection) and the performances of the last large vocabulary continuous speech 
recognition systems for Romanian developed in the SpeeD Laboratory, from Research 
Institute “CAMPUS”, University POLITEHNICA of Bucharest 
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1. Detecting overlapped speech on short timeframes using deep 
learning 


In several presentations in the Information Science and Technology 
Section of the Academy of Romanian Scientists I pointed out some of the main 
research directions for the Speech and Dialogue (“SpeeD”) team. Now I am able 
to give more details about some achievements in several arias of interest: 
detecting overlapped speech on short timeframes, emotions recognition from 
speech, new approaches to Romanian speech and speaker recognition. What do 
these seemingly very different areas have in common? 

I am trying to demonstrate that the methods offered by machine learning 
could provide viable solutions for the most diverse applications. But it is also an 
opportunity to share some of the achievements of the team I am working with. 


Long speech frames, i.e. more than 500 ms, have a higher probability of 
containing partially overlapped speech (e.g. one speaker produces an utterance 
200 ms after another one has started). This leads to risk of decreasing accuracy 
for: blind speech source separation (BSS), speaker identification, crowd-sensing. 
Detecting overlapped speech on short timeframes can contribute to key BSS 
applications. 
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In some previous papers [13], [14] we presented several methods for 
competing speaker counting and compared them with the human capabilities of 
counting the speakers in an overlapped speech recorded on a single channel. 

The Figure | shows that if there are more than 4 simultaneous speakers in 
a single channel recording, human listeners have serious difficulties in counting 
them. Therefore, we limited our overlapped speech detection study to up to 3 
simultaneous speakers. 


% of wrong answers when asked: 
"How many speakers did you detect in the recording?" 
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Figure 1. Number of competing speakers detected by human listeners 
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The setup of our experiments for training and inference will use up to 3 
simultaneous speakers per mixture. The mixer normalizes the generated samples. 
100k mixtures for training, 20k mixtures for inference. Speech source mixing 
needs to accurately label a timeframe. We only select timeframes with full 
overlapping (Figure 2). 


1.1. Feature set selection 


Providing unprocessed input to deep neural networks yields satisfactory 
results for image analysis applications, but for speech processing, feature 
engineering is still important. 

So we used a set of extracted feature sets normalized, i.e. brought to similar 

numerical ranges to speed-up convergence> 

o Signal’s frequency spectrum. In our days this is the “raw / unprocessed” input. 
Contains the highest amount of information for the deep neural network to 
create features. 

o MECC coefficients as a dominant feature in prior work. We investigated the 
use of first and second order derived coefficients with no improvement in 
accuracy. 

o Signal envelope computed with Hilbert Transform. An overlapped speech 
sample tends to have a flat shape and we expect the neural network to detect 
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Figure 3. Deep Neural Network (DNN) architecture 
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o AR (autoregressive model) coefficients produces a wide numerical range that 
was observed to amplify subtle numerical differences between overlapped and 
non overlapped speech. 

o We provide the option of adding squared features for all the components 
which can add extra performance when decision boundaries are very complex. 


1.2. Deep Neural Network (DNN) architecture 


We used convolutional layers as first layers due their accepted ability to 
create new feature sets [1], [2]. Then we considered that densely connected layers 
with convolutional layers are enough for the network to analyze the features 
(Figure 3). 


Experiments were done in order to determine the optimal combination of 
DNN parameters (Table 1). 


Design Parameter Range Optimal 

Number of convolutional layers 3-4 4 

Number of 1D filters per conv. layer 5-30 20 

Filter size on conv. layers 5-15 10 

Number of densely connected layers 3-20 6 

Units in dense layers / input size 1.1-2.0 1.5 
Table 1. DNN parameters 


For the DNN training we used TensorFlow to create the entire 
experiments’ infrastructure. Stochastic Gradient Descent was selected as the 
model weights update method by using the Momentum optimizer implemented in 
TensorFlow and activating the Nesterov Accelerated Gradient. 

For the hyperparameters we considered some usual ranges as shown in 
Table 2. 


Parameter Range Optimal 
Learning rate 104-107 10° 
Momentum 0.8 — 0.95 0.9 
Batch Size 32 — 800 430 
Learning rate decay rate 0.9 —0.99 0.99 
Learning rate decay epochs 10—50 20 


Table 2. DNN hyperparameters 
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The learning rate was decayed across epochs to ensure convergence 
towards the end of the training, when the weights need to be updated in small 
steps. Surprising enough the convergence was achieved with such reduced values 
for learning rate. Batch size depends on multiple factors, but the most important is 
related to memory usage. Batch size and the other hyperparameters are 
intercorrelated. 


The DNN architecture parameter ranges were intuitively selected based on 
the size of state of art models used in speech analysis (e.g. Deep Speech and Deep 
Speech 2). The optimal architecture for our dataset was determined after 
experimenting multiple combinations. 

Hyperparameter ranges. were identified after several sampling steps and 
some conclusions can be summarized: learning rate was the key tracked 
parameter; batch size was limited by the memory capacity of the system; the rest 
of parameters were intuitively selected based on widely known architectures 


1.3. Results 


We analyzed how the frame length affects the accuracy of the overlapped 
speech detection (remember, the frame length is important for adoption in various 
applications: e.g. longer frame lengths may be suitable for crowd sensing while 
short frame lengths can help BSS). The type selected features is presented in 
Table 3 and the detection performances using various measurements are presented 
in Table 4. 


sae FFT MFCC. AR_ Envelope Sq. Feat. 
500ms NO YES NO NO NO 
100ms NO YES YES NO YES 
25ms YES YES YES YES YES 


Table 3. Feature selection per targeted case 


Frame Detection 


Length Accuracy F-Score _ Precision Recall 
500ms 80.2% 0.8 0.81 0.78 
100ms 79% 0.78 0.82 0.74 
25ms 74.2% 0.72 O77 0.68 


Table 4. Detection performance 


Severeal conclusions can be summarised from the anlyze of the results: 
o Longer frame lengths show improved accuracy because they contain much 
more information. 
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o For 500 and 100 ms frame durations, we could not use raw inputs (e.g. 
frequency spectrum) because the training duration grows exponentially. 

o In order to obtain reasonable accuracy for short time frames we needed to add 
features. 

o We speculate that additional features may improve the F-Score of the 
detection. 

o Overlapped speech detection can be achieved successfully with the proposed 
DNN architecture composed out of convolutional layers and multi-layer- 
perceptron. 

o We obtained an F-Score of 0.72 for 25 ms timeframes. In existing literature, 
the highest F-Score — collected in circumstances similar to our experiment — 
was reported with a value of 0.63. 

o Longer frames may yield better accuracy, though longer frames my limit the 
applications where this method can be adopted — crowd-sensing is one of 
them. 


2. Emotion Recognition from Speech 


The broad framework of this topic. is the dissimulated behavior 
monitoring. We considered. the following tasks: Speech Emotion Recognition 
(SER), Speech Stress Detection (SSD) and Deceptive Speech Detection (DSD). 
The target applications are in forensics (questionings, interviews  etc.), 
surveillance (suspicious behavior), medical (monitoring / prevention — stress, 
anxiety, depression) etc. 


2.1. Classifications and main characteristics 


The important features of every task are: 
For Speech Emotion Recognition (SER): 
a) Discrete categories (classification): 
o each emotion is a separate class, with its own characteristics; 
© typical system: 4-7 emotional classes; 
© additional interest: just negative emotions (reduced set); any emotion vs. 
neutral (binary); each emotion vs. its absence (binary). 
b) Dimensional models (continuous; regression — Figure 4): 
© several properties (dimensions) determine an “affective space” in which 
each emotion class is defined by a certain sub-space; 
© typical system: 2D space (plane) (arousal & valence); 
o two possible annotations: a single value pair for every file in a database 
(global annotation) OR quasi-continuous annotation for every short time 
frame (sequential annotation). 
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Figure 4. Dimensional model for SER 


For Speech Stress Detection (SSD): 


Discrete categories (classification): 


O 


e) 
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For Deceptive Speech Detection (DSD): 


Discrete categories (classification): 


[e) 


12) 


2.2. Databases 
Building an annotated database for SER, SSD or DSD 1s a very difficult 


task taking into account some particularities: 


various types of stress (associated with speaking style, ambient conditions, 
problem solving etc.); 

typical system: 3-6 classes (out of a pool of 11-16); 
additional interest: any type vs. neutral.(binary); each type vs. its absence (binary); 
additional approach (indirect classification): using a proxy for stress, e.g. fear 
(due to scarcity of available direct data). 


simple binary classification: deceptive (untruthful) vs. non-deceptive (truthful) 


main interest: global untruthfulness (annotations available for subjects lying at 
the utterance / phrase / turn level, even if parts of their speech are truthful); 

alternative: local untruthfulness (annotations available for subjects lying at the 
segment / short phrase level). 
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© simulated data, using actors or amateur speakers (e.g. students) are often of 
little relevance for real-life situations; 

o itis very difficult to control the environment / scenario because predictable 
could mean lack of spontaneity; 

© subjective data annotation — is the most important drawback. Because of this 


poor objectivity we shall deal with relatively unreliable data. 


So, we investigated the following opensource data bases: 


a) SER — discrete categories: 


EMODB (Berlin Database of Emotional Speech): German; 10 speakers; 535 
recs.; 7 classes (e.g. Anger, Disgust, Fear, Sadness etc.); 

IEMOCAP. (Interactive. Emotional Dyadic Motion Capture Database): 
English; 10 speakers; 10039 recs.; 6 usable classes (e.g. Anger, Sadness etc.); 
CREMAD (Crowd-sourced Emotional Multimodal Actors Dataset): English; 
91 speakers; 7442 recs.; 6 classes (e.g. Anger, Disgust, Fear, Sadness etc.). 


b) SER — dimensional models: 


TEMOCAP (“IEMOCAP 2”): secondary annotation for IEMOCAP; 2 
dimensions (arousal, valence, dominance) 

RECOLA (Remote Collaborative and Affective Interaction): French; 46 
speakers; 23 long recs. (~3500 utterances); 2 dimensions (arousal, valence) 


SSD: 


SUSAS (Speech Under Simulated and Actual Stress): English; 16 speakers; 
14600 short recordings (35 keywords); 16 possible classes (various subsets of 
interest) 

SAFE (Situation Analysis in .a Fictional and Emotional Corpus): 
English/French; 400 speakers; 400 medium recs. (~12000 utterances); 2 
classes of interest (Fear and Neutral) 


DSD: 


RLDD (Real-Life Trial Data for Deception Detection): English; 56 speakers; 
121 medium recs. (~1800 utterances) 

RODeCAR (Romanian Deva Criminal Investigation Audio Recordings): 
Romanian; 19 speakers; 25 very long recs. (~13500 utterances); authentic and 
reliable data (non-subjective annotation, high-stakes contexts etc.). This is the 
most important resource because is annotated by our team on a reliable, 
genuine utterances obtained in real life in criminal investigations. 


We used featured hand-crafted or automatically extracted like acoustic and 


prosodic features: pitch, jitter, shimmer, Mel-frequency cepstral coefficients 
MFCC), loudness, low/high frequency spectral concentration etc. and additional 
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modulation-based features derived from the instantaneous amplitude and 
frequency (speech as a series of AM-FM micro-modulations) 


2.3. Deep Learning Models 


We investigated several type of neural networks in order to find the best 
solution for our targets: 
a) Multilayer perceptron (MLPs): used for the classification tasks (SER, 
SSD, DSD) and the global regression task (SER with arousal-valence utterance- 
level annotation): 
e Standalone: offers good results (similar or better than the current state of the art). 
e Ensemble classification: inspired by Support Vector Machine (SVM) 
multiclass approaches: One-vs-One (OvO) and One-vs-Rest (OvR). 
b) Bidirectional recurrent neural networks (RNNs) with long short-term 
memory (LSTM) cells (Figure 5): 
e used for all tasks, including the sequential regression task (SER with arousal- 
valence quasi-continuous annotation); 
e used standalone. 


Figure 5. Bidirectional RNNs with long LSTM cells 


c) Stacked autoencoders (SAEs- Figure 6): 
e used for the classification tasks (SER, SSD, DSD) and the global regression 
task (SER with arousal-valence utterance-level annotations); 
e used standalone; 
e using MLPs as autoencoders. 


2.4. Experimental Results 


Several performance metrics were used in order to compare the 
performances of the SER systems: 
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e WA = weighted accuracy (ratio of all correct predictions VS. total no. of 
samples). 

e UA = unweighted accuracy (average class recall, i.e. mean of per-class 
accuracies; more relevant for unbalanced classes). 

e PCC = Pearson correlation coefficient (linear correlation). 

e CCC =concordance correlation coefficient (PCC adjusted to also consider 
the prediction bias; more relevant for potentially mean-shifted data). 


Figure 6. Stacked autoencoders 


Table 5 presents the performances reported in some articles (“Article”- as 
they appear in the “References” section) compared with the results obtained by 
our team (SpeeD: Speech and Dialogue Research Laboratory), for different 
databases. The acronyms for the emotion states are: 

ANG: Anger 

HAP: Happy 

EXC: Excited 

SAD: Sadness 

NEU: Neutral 

FEA: Fear 

FRU: Frustrated 


“Nelass” stands for number of classes to discriminate among emotions and 
the results are evaluated in terms of different performance metrics presented 
above. 


2.5. Conclusions 


Current results are similar or better than the state of the art for SER with 
discrete categories on several databases and we are continuing to fine-tune the 
models in order to reach improved results on all considered datasets. We are still 
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working to improve performance for SER with dimensional models (global 
annotation approach). The SSD and DSD tasks are still work in progress. 


EMODB (Berlin Database of Emotional Speech): 


Article Nclass Best Results 
[28] 7 (all) WA = 82.4% 
[29] 4 (ANG, DIS, FEA, NEU) WA = 84.3% 
2 (EMO vs. NEU) WA = 94.9% 
[30] 7 (all) UA = 79.8% 
[31] 7 (all) WA = 81.5% 
133] 7 (all) UA = 80.6%, WA = 81.6% 
5 (ANG, DIS, FEA, SAD, NEU) UA = 87.3%, WA = 89.0% 
7 (all) UA = 84.3%, WA = 84.4% 
SpeeD 5 (ANG, DIS, FEA, SAD, NEU) UA = 93.2%, WA = 93.2% 
2 (EMO vs. NEU) UA = 98.3%, WA = 98.2% 
IEMOCAP (Interactive Emotional Dyadic Motion Capture Database): 
Article Nclass Best Results 
[22] 4 (ANG, HAP+EXC, SAD, NEU) UA = 60.5% 
[23] ads aaa 1 UA = 48.7%, WA = 57.1% (Audio) 
[27] 4 (ANG, HAP, SAD, NEU) UA = 58.8%, WA = 63.5% (LLDs) 
SpeeD eg ey caer UA = 58.7%, WA = 61.6% 


CREMAD (Crowd-sourced Emotional Multimodal Actors Dataset): 


ART Nclass Best Results 
[24] 6 (all) WA = 57.2 % (Audio) 
[25] 6 (all) WA = 57.0% (Audio) 
[26] 6 (all) WA = 41.5% (Audio) 
SpeeD 6 (all) UA = 46.7%, WA = 51.6% 
5 (ANG, DIS, FEA, SAD, NEU) UA = 50.8%, WA = 56.6% 


IEMOCAP 2 (secondary annotation for IEMOCAP): 


Article Dimensions Best Results 
PCC = 0.797 (Arousal) 
Pa ee PCC = 0.566 (Valence) 
PCC = 0.679, CCC = 0.637 (Arousal) 
ea E SN! PCC = 0.382, CCC = 0.321 (Valence) 


Table 5. Performances of several SER systems 
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3. DNN Approach to Romanian Speech Recognition 
in “SpeeD” Laboratory 


As I mentioned in some previous presentations in the Information Science 
and Technology Section of the Academy of Romanian Scientists, one of the most 
important activity of our team is to improve our results in the domain of spoken 
language technology. Some progress in Automatic Speech Recognition (ASR) 
systems will be presented in this paper [34], [35], [36]. 


3.1. Unsupervised Speech Corpus Extension 


One assumption in our attempts is that different ASR systems make 
complementary errors. So, a possible method to improve performances of such a 
system is to transcribe unlabeled audio using two different ASR systems, align 
transcriptions and keep only identical parts and eventually use the timestamps 
provided by the ASR to cut the identical parts of the original audio file. 

The details about an automatic annotation of unlabeled speech corpora and 
then how to improve the ASR systems by retraining using the new speech corpora 
are presented in the flowchart of the method (Figure 7). 

There are plenty of characteristics to build the two complementary ASR 
systems if we consider the acoustic model type, the vocabulary size, the decoding 
language model complexity and/or the rescoring language model. 

One important issue is to align and filter transcriptions. We used Dynamic 
Time Warping (DTW) to select common parts; then long sequences of 
consecutive word (if the number of characters exceeds a given threshold) are 
considered correctly transcribed and the time interval between two words must 
exclude the existence of intermediate un-transcribed words. 

It is of a great importance to establish a correct method evaluation. The 
following performance figures are considered: 

e Amount of speech selected after alignment relative to the total amount of 
unlabeled speech. 

e Annotation quality for the selected speech measurable in word error rate 
(WER) and character error rate (ChER); can be computed using already 
annotated speech (reference) along with word level timestamps. 

The speech corpora consist in: 

e Read Speech Corpus (RSC): read and clean speech utterances in silent 
environment. 

e Spontaneous Speech Corpus (SSC): spontaneous utterances from talk shows 
and news broadcasts. 
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overall performances 


Re-train main 
ASR System 


RSC-train 94h,46m 
Training 225-48 Yan 

SSC-train 130 h, 44 m 
RSC-eval 5h, 29m 

Evaluation 8h, 58m 
SSC-eval 3h, 29m 
Source #1 367 h, 57m 

Annotation Source #2 331 h, 44m 777 h, 53m 
Source #3 78h, 12m 


Table 6. Speech corpora for unsupervised speech annotation experiments 


Unlabeled corpus: speech crawled from Romanian online media (two news 
websites and one radio station), during a 9 month period (Table 6). 


Enhanced 
main ASR System 


Figure 7. The flowchart of using two complementary ASR systems to improve 
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The two ASR systems must be designed to be complementary, so they are 
different in many aspects: 


ASR #1: 
e The acoustic model is based on statistic models. 
e The speech features are 13 mel-frequency cepstral coefficients (MFCC) 
plus their first and second order derivatives. 
e The vocabulary is about 64k words. 
e The language model (LM) is based on 3-gram statistics. 
e No rescoring of the LM. 


ASR #2: 

e The acoustic model is based on time delay neural network (TDNN). 
The speech features are 40 MFCCs and 100-dimension iVectors. 
The vocabulary is 200k words. 

LM is based on 2-gram statistics. 
Rescoring for LM is using 4-gram statistics. 


The results are not quite satisfactory: by doubling the amount of training 
speech data (adding 280 hours to the original 225 hours of speech), we obtained 
only 9% relative WER improvement. So, we concentrated on some other methods 
to improve the ASR systems accuracy. 


3.2. ASR Improvements 


In the last 5 years we targeted several directions to develop our large 
vocabulary continuous speech recognition (LVCSR) system: 

e Speech and text resources acquisition. 

e Improved language models: larger vocabulary, more grams for. statical 
models. 

e Improved acoustic models by switching from statistical models to deep 
neural network models. 

e Speech feature transforms. 

e Lattice rescoring after speech decoding. 


The acoustic model is now based on Time Delay Neural Network (TDNN) 
which is able to learn long-term temporal dependencies. We are using as input 9 
frames of relatively standard speech features: MFCCs and iVectors (especially 
useful for speaker adaptation). The input layer size is couple of thousand neurons 
and the output layer size is couple of hundred neurons. There are 3 - 6 hidden 
layers with around 1200 neurons. Framework and algorithms used are available in 
Kaldi ASR toolkit. Some experimental results are shown in Table 7. 
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3500 in neurons 
350 out neurons 
6 hidden layers 


Table 7. Experimental results with TDNN used for the acoustic model 


The Mel-frequency cepstral coefficients (MFCC) are extracted from 25 ms 
signal window length, shifted by 10 ms. The final feature vector has_ the 
dimension of 13 MFCCs x 9 frames. Supplementary, we used some features 
transforms: 

e Cepstral mean and variance normalization (CMVN) in order to normalize the 
mean and variance of raw cepstra and eliminate inter-speaker and environment 
variations. 

e Linear Discriminant Analysis (LDA) in order to reduce features space 
dimension keeping class discriminatory information. 

e Maximum Linear Likelihood Transform (MLLT). to capture correlation 
between the feature vector components. 


The improvement of the language model (LM) has the following 
characteristics: 
e Kaldi ASR toolkit was used because it allows using LMs larger vocabularies 
(more than 64k words). 

o The text corpora used for language modeling was extended by 
collecting new texts from the Internet. We point out that text collected 
from the Internet needed diacritics restoration. 

o We have about 315M word tokens (in 2017) 

o Talk shows transcriptions (40M word tokens) already available. 

e For the language models (LM): 

o Statistical n-gram models are used created by interpolating text 
corpora with various weights. 

o Various n-gram orders: from 1-gram to 5-gram. 

o Various vocabulary sizes: 64k, 100k, 150k and 200k words. 
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The results of language models’ evaluations, without rescoring, are shown 
in Table 8. As expected, the best results are obtained for 200 kwords vocabulary 
with 3-gram models. 


100 k words 


150K word 
200 k words 


Table 8. LM evaluation without rescoring 


When rescoring based on the algorithm presented above is used the results 
are slightly better as can be seen in Table 9. 


| gram | 15.0 | 365 | 6.06 [225 | 

100 k words 

evens Ppa 
200 k words 


nee StS 
150 k words 


[2-eram | soo | 232 [| 449 | 202 


Table 9. LM evaluation with rescoring 
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The overall improvement is shown in Table 10. Several conclusions can be 

pointed out: 

e Several improvements of Speech and Dialogue Laboratory (SpeeD) LVCSR 
system for Romanian language were presented. 

e The application of feature transforms, discriminative training and speaker 
adaptive training algorithms led to a lower WER. 

e The use of DNN acoustic models is the most important change. 

e Relative WER improvements between 20.7% to 30.8% over older statistic 
models. 

e Increasing the LM size and the use of lattice rescoring triggered a lower WER. 

e The overall relative WER improvement over the older system of about 5 years 
ago are: 70% on read speech and 48% on spontaneous speech. 


Statistic 
(CMU Sphinx, 2014) 64 k words, 3-gram 
Statistic 
(CMU Sphinx, 2017) 64 k words, 3-gram 


ae 7) 64 k words, 3-gram 
64 k words, 3-gram 


DNN (Kaldi, 2017) 200.k words, 2-gram 
4-sram (rescore) 


Table 10. Overall improvements of our LVCSR system in the last 5 years 
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